24 research outputs found

    Data Fusion by Matrix Factorization

    Get PDF
    For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.Comment: Short preprint, 13 pages, 3 Figures, 3 Tables. Full paper in 10.1109/TPAMI.2014.234397

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    A Matrix Factorization Approach for Inference of Prediction Models from Heterogeneous Data Sources

    Get PDF
    Today we are witnessing rapid growth of data both in quantity and variety in all areas of human endeavour. Integrative treatment of these sources of information is a major challenge. We propose a new computation method for inference of prediction models. The method uses symmetric penalized matrix tri-factorization and prioritizes predictions by estimating probabilities from matrix factors. The approach represents a new concept of data integration by intermediate strategy which is both generally applicable as well as highly effective and reliable. Major advantages of the approach are an elegant mathematical formulation of the problem, ability to integrate any kind of data that can be expressed in matrix form, and high predictive accuracy. We tested the effectiveness of the proposed method on predicting gene annotations of social amoebae D. discoideum. The developed model integrates gene expressions, protein-protein interactions and known gene annotations. Model, inferred by proposed method, achieves higher accuracy than standard techniques of early and late integration, which combine inputs and predictions, respectively, and have in the past been favourably reported for their accuracy. With the proposed approach we have also predicted that there are a few genes of D. discoideum that may have a role in bacterial resistance and which were previously not associated with this function. Amoebae is an important model organism, also known for its predation of bacteria, among which are some dangerous to humans and have recently been increasingly resistant to developed antibiotics. Until now, only a handful of genes were known to participate in related bacterial recognition pathways of amoebae. Our predictions of five new genes were experimentally confirmed in wet-lab experiments at the collaborating institution (Baylor College of Medicine, Houston, USA). Expanding the list of such genes is crucial in the studies of mechanisms for bacterial resistance and can contribute to the research in development of alternative antibacterial therapy

    100 years of the educational system in Lipoglav

    Get PDF
    Razvoj šolstva v Sloveniji je odraz spreminjanja družbeno-političnih razmer kot del slovenske kulture in nacionalne zgodovine. Diplomsko delo predstavlja zgodovinski pregled razvoja javnega šolstva v Sloveniji in primerjalno zadnjih sto let na Lipoglavu. Delo je razdeljeno na tri večje sklope. V prvem sem predstavila naravne, geografske in kulturnozgodovinske znamenitosti Lipoglava in njegove okolice, v drugem sem se osredotočila na zgodovinski razvoj šolstva širše in v tretjem na vpliv le-tega na stoletni razvoj šolstva na Lipoglavu. Za vsako obdobje so opisane temeljne značilnosti razvoja šolstva v Sloveniji in neizogiben vpliv na razvoj šolstva na Lipoglavu. Razvoj šolstva je močno krojil gospodarski razvoj kraja. Vse spremembe v kraju (gradnje ceste, vodovoda, obnove vasi, uveljavljanje podjetnikov, priseljevanje, dnevna migracija delavcev v bližnje mesto za zaslužkom), so vplivale, da se je Lipoglav z okolico iz agrarnega podeželja začel spreminjati v ljubljansko predmestje. S teoretično analizo virov in literature, obogatenih z elementi kvalitetne raziskave, s številnimi intervjuji nekdanjih učiteljev in učiteljic na Lipoglavu, je dokazan vpliv gospodarstva na razvoj šolstva na Lipoglavu, katerega začetek sega v daljno leto 1865 in kasneje na gradnjo šolskega objekta pred NOB, obnovo po njej in ponovno novo gradnjo pred leti. Skozi historično paradigmo je prikazana tudi feminizacija učiteljskega poklica na Lipoglavu, ki se je še pred desetletji zaradi slabe infrastrukture, z mestnim jedrom nahajal v geografsko odmaknjenem področju. Rahlo upadanje števila šoloobveznih otrok in dnevna migracija le-teh, sta nas prisilila v uvedbo oddelka podaljšanega bivanja, gradnjo nove šole in vrtca ter s tem pokritje delne starševske skrbi za kvalitetnejše preživljanje prostega časa, ko otrok ni v šoli – pri pouku. Naša izkazana skrb že za najmlajše se odraža v polno zasedenem vrtcu in posledično čez leto ali dve tudi v šoli. V vsej zgodovini šolstva na Lipoglavu so domačini dokazali, da jim za šolo ni vseeno in da se bodo za njen obstoj skupaj z menoj, ki sem del njih, zavzemali tudi v bodoče ter ohranili svoje izobraževalno, kulturno, vzgojno, športno in rekreacijsko središče v kraju.The development of educational system in Slovenia is the reflection of changes in social and political circumstances, as a part of Slovenian culture and national history. The diploma thesis introduces a review of the development of public education in Slovenia and its correlation with educational processes in Lipoglav during the last hundred years. The thesis is divided into three major parts. The first part deals with the geographical and historical sights of Lipoglav and surroundings. In the second part a detailed description of the historical development of the compulsory school is presented and the third part provides the analysis of the impacts of the national policies on the development of education in Lipoglav in the last century. Basic characteristics regarding the development of educational system in Slovenia and inevitable circumstances on education in Lipoglav are described for every period. The development of educational system in Lipoglav is connected to the development of economy. Lipoglav, which was initially agricultural, witnessed its prospering in the last century (better roads, reconstruction of both public and private buildings, development of entrepreneurship, immigration, daily migration to nearby towns), which had positive impact on the urbanization and strengthened its relationship with Ljubljana. Theoretical analysis of sources and literature, accompanied by elements of qualitative research, including half-structured interviews of former teachers in Lipoglav, demonstrate the influence of economy on the development of educational system in Lipoglav, the beginning of which dates back to year 1865, followed by construction of school building before the Second World War and its reconstruction after the war. Likewise, the feminisation of the teaching profession in Lipoglav, which was before the improvements of infrastructure situated in a geographically remote region, is presented through the lens of historical paradigm. The decrease in the number of school-age children and their daily migration to the capital has forced us into introduction of the numerous after-lessons activities, construction of new modern school building, which includes kindergarten. All these factors contribute to the quality of pupils’ free time and lessen the burden of parents’ care. Our efforts for the youngest members of the society has already been shown as we have riped the awards in considerable increase of children in kindergarten and consequently in primary school as well. During the last century the natives have proven that they do care for the existence of school in Lipoglav and we will advocate its further development to keep the school educational, information, cultural and sports centre of the entire village

    A Matrix Factorization Approach for Inference of Prediction Models from Heterogeneous Data Sources

    Get PDF
    Today we are witnessing rapid growth of data both in quantity and variety in all areas of human endeavour. Integrative treatment of these sources of information is a major challenge. We propose a new computation framework for inference of prediction models based on symmetric penalized matrix tri-factorization and intermediate strategy for data integration. Major advantages of the approach are an elegant mathematical formulation of the problem, an integration of any kind of data that can be expressed in matrix form, and high predictive accuracy. We tested the effectiveness of the proposed framework on predicting gene annotations of social amoebae D. dictyostelium. The developed model integrates gene expressions, protein-protein interactions and known gene annotations. The model achieves higher accuracy than standard techniques of early and late integration, which combine inputs and predictions, respectively, and have in the past been favourably reported for their accuracy. With the proposed approach we have also predicted that there is a set of genes of D. dictyostelium that may have a role in bacterial resistance and which were previously not associated with this function. Until now, only a handful of genes were known to participate in related bacterial recognition pathways. Expanding the list of such genes is crucial in the studies of mechanisms for bacterial resistance and can contribute to the research in development of alternative antibacterial therapy. Our predictions were experimentally confirmed in wet-lab experiments at the collaborating institution (Baylor College of Medicine, Houston, USA)

    Learning by Fusing Heterogeneous Data

    Get PDF
    Podatkovno-intenzivni postopki v tehnologiji in znanosti nam v zadnjih letih omogočajo zajem velike količine heterogenih podatkov, ki opisujejo sisteme na različnih nivojih granularnosti in z različnih zornih kotov. Zbrani podatki so pogosto predstavljeni v povsem različnih podatkovnih domenah, kar predstavlja izziv za algoritme, ki gradijo napovedne modele z zlivanjem podatkov. Naše raziskave temeljijo na premisi, da je heterogene podatke mogoče »organizirati,« tako da vzpostavimo ustrezne preslikave med posameznimi dimenzijami vhodnih podatkovnih domen. Ozko grlo, ki nas loči od boljšega razumevanja podatkovne domene in s tem tudi od bolj učinkovite gradnje napovednih modelov z zlitjem velikih heterogenih podatkov, je prepoznava vrste informacije, ki jo je možno prenesti med povezanimi podatkovnimi nabori, objekti različnih tipov in napovednimi nalogami. V disertaciji predlagamo več zanimivih in zmogljivih napovednih modelov za učenje iz heterogenih podatkov. Ti pristopi so splošni, dosegajo visoko napovedno točnost in so enostavni za uporabo: v veliki meri se izognejo dolgotrajnim in zahtevnim predobdelavam podatkov, na katere se zanašajo trenutni modeli, ki heterogene podatke najpogosteje poskušajo preslikati v enovit podatkovni prostor. Razviti algoritmi so se izkazali za obetavne na večih področjih človekovega delovanja, a smo se v tem delu osredotočili na reševanje aktualnih problemov v molekularni in sistemski biologiji. Ti med drugim vključujejo napovedovanje genskih funkcij in farmakoloških akcij, rangiranje obetavnih genov za nadaljnje biološke raziskave, odkrivanje vzorcev povezav med boleznimi, odkrivanje toksičnosti zdravil in analizo umrljivosti. Pomemben vidik naših raziskav predstavlja študij latentnih faktorskih modelov. Razvijemo več latentnih modelov s faktoriziranimi parametri, ki lahko sočasno naslavljajo več vrst podatkovne heterogenostito je, raznolikosti, ki zaobsega heterogene podatkovne domene, več tipov entitet in različne napovedne naloge. Prednost naših algoritmov pred uveljavljenimi pristopi je sposobnost ohranitve strukture odvisnosti med podatki tekom gradnje napovednih modelov, kar smo empirično preverili v večih študijah. Naše nedavno delo na tem področju obsega pristope za gradnjo mrež z analizo podatkov iz večih morebitno različnih podatkovnih porazdelitev, ki smo jih uporabili za avtomatično gradnjo genskih regulatornih mrež pri bolezni raka. Modelirali smo tudi epistazo, ki predstavlja pomemben koncept v genetiki. V ta namen smo predlagali učinkovite algoritme za določitev vrstnega reda delovanja genov v genskih poteh, ki porabijo nekajkrat manj računskih virov od znanih tehnik. Ena izmed osrednjih tem doktorske disertacije je analiza velikih podatkovnih zbirk. V empiričnih študijah smo namreč opazili, da je za zanesljive napovedi v bioinformatiki, zaželjene na primer pri odkrivanju odvisnosti med boleznimi in ocenjevanju vpletenosti genov v razne fenotipe, pogosto koristno sklepati na osnovi meritev, ki izhajajo iz različnih eksperimentalnih ali predhodnih računskih postopkov. Med drugim v delu analiziramo 30 heterogenih podatkovnih zbirk, ki nam služijo za ocenjevanje toksičnosti zdravil, in več kot 40 zbirk o odvisnostih med geni v človeku. Slednje predstavlja analizo najobsežnejše zbirke podatkov v dosedanjih študijah latentnih faktorskih modelov. Tolikšna razsežnost podatkov poraja nova vprašanja o izbiri ustreznih podatkovnih virov za zlivanje, za kar predlagamo splošni pristop ocenjevanja občutljivosti med viri.It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    A Matrix Factorization Approach for Inference of Prediction Models from Heterogeneous Data Sources

    Get PDF
    Today we are witnessing rapid growth of data both in quantity and variety in all areas of human endeavour. Integrative treatment of these sources of information is a major challenge. We propose a new computation method for inference of prediction models. The method uses symmetric penalized matrix tri-factorization and prioritizes predictions by estimating probabilities from matrix factors. The approach represents a new concept of data integration by intermediate strategy which is both generally applicable as well as highly effective and reliable. Major advantages of the approach are an elegant mathematical formulation of the problem, ability to integrate any kind of data that can be expressed in matrix form, and high predictive accuracy. We tested the effectiveness of the proposed method on predicting gene annotations of social amoebae D. discoideum. The developed model integrates gene expressions, protein-protein interactions and known gene annotations. Model, inferred by proposed method, achieves higher accuracy than standard techniques of early and late integration, which combine inputs and predictions, respectively, and have in the past been favourably reported for their accuracy. With the proposed approach we have also predicted that there are a few genes of D. discoideum that may have a role in bacterial resistance and which were previously not associated with this function. Amoebae is an important model organism, also known for its predation of bacteria, among which are some dangerous to humans and have recently been increasingly resistant to developed antibiotics. Until now, only a handful of genes were known to participate in related bacterial recognition pathways of amoebae. Our predictions of five new genes were experimentally confirmed in wet-lab experiments at the collaborating institution (Baylor College of Medicine, Houston, USA). Expanding the list of such genes is crucial in the studies of mechanisms for bacterial resistance and can contribute to the research in development of alternative antibacterial therapy

    Gene network inference by fusing data from diverse distributions

    No full text
    corecore